Table of contents

  • Logs for the methylation data app
    • Introduction
      • 21-11-2024
    • Loading in the data
      • 21-11-2024
    • Data exploration
      • 22-11-2024
      • 25-11-2024
    • BED file annotation
      • 26-11-2024
    • 28-11-2024
      • 03-12-2024
      • 04-12-2024
      • 09-12-2024
      • 10-12-2024

Logs for the methylation data app¶

Introduction¶

21-11-2024¶

This logbook will describe the process of creating visualisations, ideas. These visualisations and ideas will be used to create an application for research students. This application will take DNA methylation data as input. This app will make it easier for the students to look into their generated data, and it will help them with understanding their data.

Loading in the data¶

21-11-2024¶

I would like to combine the data from all the files into one single file, with the id in the column of the df. This way i could compare different conditions to eachother.

The first code-block is to load in the used libraries.

InĀ [1]:
import os
import seaborn as sns
import matplotlib.pyplot as plt
import polars as pl
import numpy as np
import datashader as ds
import datashader.transfer_functions as tf
from Bio import SeqIO
import hvplot.polars
import pandas as pd
import re
import plotly.express as px
InĀ [2]:
barcodes_names: pl.dataframe = pl.read_csv("/commons/Themas/Thema06/Methylatie/barcodes.csv")

barcodes_names = barcodes_names.with_columns(controle_n = pl.int_range(pl.len()).over(" description")+1)
barcodes_names = barcodes_names.with_columns(group_and_n = pl.concat_str([pl.col(' description'), pl.col("controle_n")]))

barcodes_names = barcodes_names.with_columns(pl.col(pl.Utf8).str.strip_chars()).drop("controle_n")

print(barcodes_names)
shape: (5, 3)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ barcode ┆  description        ┆ group_and_n          │
│ ---     ┆ ---                 ┆ ---                  │
│ i64     ┆ str                 ┆ str                  │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ 11      ┆ Jurkat_DMSO_control ┆ Jurkat_DMSO_control1 │
│ 12      ┆ Jurkat_betuline     ┆ Jurkat_betuline1     │
│ 13      ┆ Healthy_control     ┆ Healthy_control1     │
│ 14      ┆ Jurkat_betuline     ┆ Jurkat_betuline2     │
│ 15      ┆ Jurkat_DMSO_control ┆ Jurkat_DMSO_control2 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This generates a data frame that contains the barcode and also the description of the barcode The column called group_and_n contains the description with a control group number

This is needed to label the different groups in the df that will contain all of the data

Which will be loaded in the code below this block

InĀ [3]:
path: str = "/commons/Themas/Thema06/Methylatie/analysis"
def load_files(path: str) -> pl.dataframe:
    resulting_df: pd.DataFrame = pl.DataFrame(
        {"chr":[],
         "start":[],
         "end":[],
         "frac":[],
         "valid":[],
         "group_name":[]}
    )
    files: list[str] = os.listdir(path)

    for file in files:
        if os.path.isfile(f"{path}/{file}") and file.endswith("methylatie_ALL.csv"):
            temp_df: pd.DataFrame = pd.read_csv(f"{path}/{file}", sep="\t")
            temp_df: pl.DataFrame = pl.from_pandas(temp_df)
            barcode_num: list[int] = re.findall(r"\d+", file)

            name_group: str = barcodes_names.filter(pl.col("barcode").cast(pl.String) == barcode_num[0]).select("group_and_n")
            temp_df: pl.DataFrame = temp_df.with_columns(pl.lit(name_group).alias("group_name"))
            resulting_df = pl.concat([temp_df, resulting_df])
    
    return resulting_df
    
df: pl.DataFrame = load_files(path=path)

All of the csv files are now loaded into 1 polars dataframe

InĀ [4]:
print(df.head())
shape: (5, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr  ┆ start ┆ end   ┆ frac ┆ valid ┆ group_name       │
│ ---  ┆ ---   ┆ ---   ┆ ---  ┆ ---   ┆ ---              │
│ str  ┆ i64   ┆ i64   ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr1 ┆ 10468 ┆ 10469 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10470 ┆ 10471 ┆ 1.0  ┆ 2     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10488 ┆ 10489 ┆ 1.0  ┆ 2     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10492 ┆ 10493 ┆ 1.0  ┆ 2     ┆ Jurkat_betuline2 │
│ chr1 ┆ 10496 ┆ 10497 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Data exploration¶

22-11-2024¶

I now have a data frame with the methylation data with a column called group_name that holds the name of the group of which the data comes from

InĀ [5]:
test: pl.DataFrame = df.filter(
    pl.col("group_name").is_in(['Healthy_control1', 'Jurkat_betuline1', 'Jurkat_betuline2'])
)
test: pl.DataFrame = df.filter((pl.col("start") >= 60778131) & 
                   (pl.col("end") <= 60778731) & 
                   (pl.col("chr") == "chr10"))
print(test)
shape: (83, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ frac ┆ valid ┆ group_name       │
│ ---   ┆ ---      ┆ ---      ┆ ---  ┆ ---   ┆ ---              │
│ str   ┆ i64      ┆ i64      ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60778212 ┆ 60778213 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778217 ┆ 60778218 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778237 ┆ 60778238 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778258 ┆ 60778259 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778283 ┆ 60778284 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ …     ┆ …        ┆ …        ┆ …    ┆ …     ┆ …                │
│ chr10 ┆ 60778682 ┆ 60778683 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778689 ┆ 60778690 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778700 ┆ 60778701 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778707 ┆ 60778708 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60778723 ┆ 60778724 ┆ 0.0  ┆ 3     ┆ Healthy_control1 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains all methylation data between the range of 60778131 and 60778731 on chr 10 (CDK1). This is a possible way to filter for all the methylated DNA in a range. This is also filtered for 3 selected groups.

InĀ [6]:
all_groups = pl.DataFrame({"group_name": df["group_name"].unique()})

test_agg: pl.DataFrame = (
    test
    .select(["group_name", "frac"])
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
    .join(all_groups, on="group_name", how="full")
    .with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
    .drop("group_name_right") 
    .fill_null(0)
)
print(test_agg)
shape: (5, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ group_name           ┆ n methylations │
│ ---                  ┆ ---            │
│ str                  ┆ u32            │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ Jurkat_betuline1     ┆ 0              │
│ Jurkat_betuline2     ┆ 39             │
│ Jurkat_DMSO_control1 ┆ 0              │
│ Jurkat_DMSO_control2 ┆ 0              │
│ Healthy_control1     ┆ 44             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains the number of methylations in the range specified in the test df. It appears that there's only methylations for the healthy control group and 1 of the 2 betuline control groups. These results overlap with the results found by the students for this gene (CDK1).

InĀ [7]:
sns.set_theme()

sns.barplot(data = test_agg.sort("n methylations", descending=True),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2")
plt.show()
No description has been provided for this image

This plot showcases the amount of methylations for a certain gene (CDK1). It appears that there's only methylated DNA for 2 groups.

These results overlap with the research and processing of the data that the research students have done.

25-11-2024¶

  • Extra md explanation for code above this block.

The following thing i would like to check is if the difference between end and start are always 1. This is to check if there is any possibly faulty data.

InĀ [8]:
start_end_diff: pl.DataFrame = (
    df
    .filter(pl.col("end") - pl.col("start") != 1)
)
print(start_end_diff)
shape: (0, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr ┆ start ┆ end ┆ frac ┆ valid ┆ group_name │
│ --- ┆ ---   ┆ --- ┆ ---  ┆ ---   ┆ ---        │
│ str ┆ i64   ┆ i64 ┆ f64  ┆ i64   ┆ str        │
ā•žā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
ā””ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This ouput means that the difference between end and start is always equal to 1. This means that there are no faulty positions.

I would like to see if there are any major differences between the different groups and the amount of methylated DNA.

InĀ [9]:
df_n_methylation: pl.DataFrame = (
    df
    .select("group_name")
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
)
print(df_n_methylation.head())
shape: (5, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ group_name           ┆ n methylations │
│ ---                  ┆ ---            │
│ str                  ┆ u32            │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ Jurkat_betuline1     ┆ 1994070        │
│ Healthy_control1     ┆ 14820163       │
│ Jurkat_DMSO_control1 ┆ 2541070        │
│ Jurkat_DMSO_control2 ┆ 2272843        │
│ Jurkat_betuline2     ┆ 2654900        │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This table holds the total amount of methylations. Visualising this table would make it easier to see any possible differences.

InĀ [10]:
sns.barplot(data = df_n_methylation,
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2",).set(
                title="Amount of methylations for all groups",
                xlabel="Amount of methylations", ylabel = "Group name")
plt.show()
No description has been provided for this image

This plot visualises the amount of methylations for every group that is part of the experiment. The x-axis holds the number of methylations, while the y-axis holds the name of the group that the number belongs to.

This plot clearly showcases that the healthy control group has way more methylations then the other groups. This implies that the other groups might have some sort of effect on the methylation. It is unclear if this is the betuline, the DMSO control group also appears to impact the methylation.

I could possibly zoom more into to other groups, to visualise the differences between the treated groups.

InĀ [11]:
sns.barplot(data = df_n_methylation.filter(pl.col("group_name") != "Healthy_control1"),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2",).set(
                title="Amount of methylations for all treated groups",
                xlabel="Amount of methylations", ylabel = "Group name")
plt.show()
No description has been provided for this image

This plot visualises the amount of methylations for every group that is part of the experiment. The x-axis holds the number of methylations, while the y-axis holds the name of the group that the number belongs to.

This visualises that there does not appear to be a pattern between and inside of groups. To make data selection i need to do the following:

  • Give the given bedfile gene symbols to easier look for specific genes

I will clean the bedfile first, it is very inconsistant with tabs and spaces.

BED file annotation¶

26-11-2024¶

InĀ [12]:
path_bed = "/commons/Themas/Thema06/Methylatie/RRMS_human_hg38.bed"
file_cleaned = []
with open(path_bed, 'r') as bed_file:
    for line in bed_file:
        replaced_line = re.sub(r"\s+", "\t", line.strip())

        file_cleaned.append(replaced_line)
with open("../data/new_bed_file.bed", "w") as new_bed:
    new_bed.write("chr\tstart\tend\n")
    new_bed.write("\n".join(file_cleaned))
InĀ [13]:
bed_df = pl.from_pandas(pd.read_csv("../data/new_bed_file.bed", sep="\t"))
print(bed_df)
shape: (18_069, 3)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      │
│ ---   ┆ ---      ┆ ---      │
│ str   ┆ i64      ┆ i64      │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr1  ┆ 24735    ┆ 33737    │
│ chr1  ┆ 131124   ┆ 139563   │
│ chr1  ┆ 195251   ┆ 204121   │
│ chr1  ┆ 364792   ┆ 386185   │
│ chr1  ┆ 487107   ┆ 495546   │
│ …     ┆ …        ┆ …        │
│ chr18 ┆ 59898996 ┆ 59900196 │
│ chr19 ┆ 47220224 ┆ 47221024 │
│ chr11 ┆ 798884   ┆ 799484   │
│ chr10 ┆ 60778131 ┆ 60778731 │
│ chr17 ┆ 7667421  ┆ 7668621  │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
InĀ [14]:
biomart_bed = pl.from_pandas(pd.read_csv("/homes/rreilman/Downloads/mart_export.txt", low_memory=False))
biomart_bed = biomart_bed.select(["Chromosome/scaffold name", "Gene start (bp)", "Gene end (bp)", "Gene name"])
biomart_bed = biomart_bed.rename({
    "Chromosome/scaffold name": "chr",
    "Gene start (bp)": "start",
    "Gene end (bp)": "end",
    "Gene name": "gene_name"
})
biomart_bed = biomart_bed.with_columns(
    pl.col("gene_name").fill_null("unknown gene")
)
print(biomart_bed.head())
shape: (5, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr ┆ start ┆ end  ┆ gene_name │
│ --- ┆ ---   ┆ ---  ┆ ---       │
│ str ┆ i64   ┆ i64  ┆ str       │
ā•žā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ MT  ┆ 577   ┆ 647  ┆ MT-TF     │
│ MT  ┆ 648   ┆ 1601 ┆ MT-RNR1   │
│ MT  ┆ 1602  ┆ 1670 ┆ MT-TV     │
│ MT  ┆ 1671  ┆ 3229 ┆ MT-RNR2   │
│ MT  ┆ 3230  ┆ 3304 ┆ MT-TL1    │
ā””ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
InĀ [15]:
def annotate_bed(bed_df: pl.DataFrame, annotate_df: pl.DataFrame):
    new_df = []
    for promoter in bed_df.iter_rows():
        chr_promoter, start_promoter, end_promoter = promoter

        overlaps = (annotate_df
                    .filter(
                        (pl.col("chr") == chr_promoter.replace("chr", "")) &
                        (pl.col("start") <= end_promoter) &
                        (pl.col("end") >= start_promoter)
                        )
                    )
        if not overlaps.is_empty():
            for gene in overlaps["gene_name"].to_list():
                new_df.append({"chr":chr_promoter,
                           "start":start_promoter,
                           "end":end_promoter,
                          "gene_name":gene})

        else:
            new_df.append({"chr":chr_promoter,
                           "start":start_promoter,
                           "end":end_promoter,
                          "gene_name":"Unknown gene"})
            
    
    
    return pl.DataFrame(new_df)
#bed_new_df = annotate_bed(bed_df, biomart_bed)

The bed file should now contain the promotor locations with fitting genes. Lets test this by searching for a gene the students used for their research

InĀ [16]:
#print(bed_new_df.filter(pl.col("gene_name").list.contains("CDK1")))

This result shows that there is no CDK1 found in the promoter bed file, which is false considering the students used it. Martijn told me about annotation from GBFF file, im going to look into that.

28-11-2024¶

i'm going to look at annotating my bed file via a GBFF file from NCBI.

InĀ [17]:
gbff_file = "/homes/rreilman/jaar2/ncbi_dataset/data/GCF_000001405.26/genomic.gbff"
gene_information = []
for record in SeqIO.parse(gbff_file, "genbank"):
    chr_name = record.id
    for feature in record.features:
        if feature.type == "gene":
            start = int(feature.location.start)
            end = int(feature.location.end)
            gene_name = feature.qualifiers.get("gene", ["Unknown"])[0]
            gene_information.append({"chr": chr_name,
                                     "start":max(0, start-1000),
                                     "end":end,
                                     "gene_name":gene_name})
gbff_gene_df = pl.DataFrame(gene_information)
print(gbff_gene_df.filter(pl.col("gene_name") == "CDK1"))
shape: (1, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr          ┆ start    ┆ end      ┆ gene_name │
│ ---          ┆ ---      ┆ ---      ┆ ---       │
│ str          ┆ i64      ┆ i64      ┆ str       │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ NC_000010.11 ┆ 60771975 ┆ 60794852 ┆ CDK1      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains the chr start end and name of every gene in the GBFF file The chromosome naming convention does not match the way our bed file is made (chr*) so i will have to change that.

InĀ [18]:
chrome_mapping = pl.from_pandas(pd.read_csv("/homes/rreilman/jaar2/chromosome_mapping.csv", delimiter="\t"))
print(chrome_mapping)
shape: (455, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ RefSeq seq accession ┆ Chromosome name │
│ ---                  ┆ ---             │
│ str                  ┆ str             │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ NC_000001.11         ┆ 1               │
│ NC_000002.12         ┆ 2               │
│ NC_000003.12         ┆ 3               │
│ NC_000004.12         ┆ 4               │
│ NC_000005.10         ┆ 5               │
│ …                    ┆ …               │
│ NT_187685.1          ┆ 19              │
│ NT_187686.1          ┆ 19              │
│ NT_187687.1          ┆ 19              │
│ NT_113949.2          ┆ 19              │
│ NC_012920.1          ┆ MT              │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This df contains the NCBI naming convention and the way i have named my chromosomes.

InĀ [19]:
gbff_gene_df_updated = gbff_gene_df.join(chrome_mapping, left_on="chr", right_on="RefSeq seq accession", how="left")
gbff_gene_df_updated = gbff_gene_df_updated.with_columns(
    pl.when(pl.col("Chromosome name").is_not_null())
    .then(pl.col("Chromosome name"))
    .otherwise(pl.col("chr"))
    .alias("chr")
).select(["chr", "start", "end", "gene_name"])
print(gbff_gene_df_updated.filter(pl.col("gene_name") == "CDK1"))
shape: (1, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr ┆ start    ┆ end      ┆ gene_name │
│ --- ┆ ---      ┆ ---      ┆ ---       │
│ str ┆ i64      ┆ i64      ┆ str       │
ā•žā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ 10  ┆ 60771975 ┆ 60794852 ┆ CDK1      │
ā””ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This dataframe now used our naming convention, instead of NCBI chromosome naming convention. The next step is to annotate out bed dataframe.

InĀ [20]:
annotated_bed_file = annotate_bed(bed_df, gbff_gene_df_updated)
print(annotated_bed_file)
#annotated_bed_file.write_csv("../data/annotated_bed.bed")
shape: (30_681, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ gene_name    │
│ ---   ┆ ---      ┆ ---      ┆ ---          │
│ str   ┆ i64      ┆ i64      ┆ str          │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr1  ┆ 24735    ┆ 33737    ┆ WASH7P       │
│ chr1  ┆ 24735    ┆ 33737    ┆ MIR1302-2    │
│ chr1  ┆ 24735    ┆ 33737    ┆ FAM138A      │
│ chr1  ┆ 24735    ┆ 33737    ┆ LOC102724250 │
│ chr1  ┆ 24735    ┆ 33737    ┆ TRNAN-GUU    │
│ …     ┆ …        ┆ …        ┆ …            │
│ chr19 ┆ 47220224 ┆ 47221024 ┆ BBC3         │
│ chr11 ┆ 798884   ┆ 799484   ┆ PANO         │
│ chr11 ┆ 798884   ┆ 799484   ┆ PIDD         │
│ chr10 ┆ 60778131 ┆ 60778731 ┆ CDK1         │
│ chr17 ┆ 7667421  ┆ 7668621  ┆ TP53         │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

I will now create the same barplot as before, to check if the result is the same.

InĀ [21]:
def get_gene_info(genes_list: list[str], annotated_bed_df):
    df_wanted = (annotated_bed_df
                 .filter(pl.col("gene_name").is_in(genes_list)))
    return df_wanted
df_cdk1 = get_gene_info(["CDK1"], annotated_bed_file)
print(df_cdk1)
shape: (2, 4)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ gene_name │
│ ---   ┆ ---      ┆ ---      ┆ ---       │
│ str   ┆ i64      ┆ i64      ┆ str       │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60774212 ┆ 60783104 ┆ CDK1      │
│ chr10 ┆ 60778131 ┆ 60778731 ┆ CDK1      │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

This dataframe contains the promoter areas for the CDK1 gene. I will use these to filter the df with the methylation data. And create a barplot

InĀ [22]:
methylation_cdk1 = (df
                    .filter((pl.col("chr").is_in(df_cdk1.select("chr"))) &
                            (pl.col("start") >= df_cdk1.select("start").to_numpy()[0]) &
                            (pl.col("end") <= df_cdk1.select("end").to_numpy()[0])))
print(methylation_cdk1)



test_agg2: pl.DataFrame = (
    methylation_cdk1
    .select(["group_name", "frac"])
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
    .join(all_groups, on="group_name", how="full")
    .with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
    .drop("group_name_right") 
    .fill_null(0)
)
print(test_agg2.head())

sns.barplot(data = test_agg2.sort("n methylations", descending=True),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2")
plt.show()
shape: (248, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ frac ┆ valid ┆ group_name       │
│ ---   ┆ ---      ┆ ---      ┆ ---  ┆ ---   ┆ ---              │
│ str   ┆ i64      ┆ i64      ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60774589 ┆ 60774590 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60774731 ┆ 60774732 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60775086 ┆ 60775087 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60775117 ┆ 60775118 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60775225 ┆ 60775226 ┆ 1.0  ┆ 1     ┆ Jurkat_betuline2 │
│ …     ┆ …        ┆ …        ┆ …    ┆ …     ┆ …                │
│ chr10 ┆ 60780807 ┆ 60780808 ┆ 1.0  ┆ 2     ┆ Healthy_control1 │
│ chr10 ┆ 60780818 ┆ 60780819 ┆ 1.0  ┆ 2     ┆ Healthy_control1 │
│ chr10 ┆ 60781020 ┆ 60781021 ┆ 1.0  ┆ 3     ┆ Healthy_control1 │
│ chr10 ┆ 60781482 ┆ 60781483 ┆ 1.0  ┆ 4     ┆ Healthy_control1 │
│ chr10 ┆ 60781532 ┆ 60781533 ┆ 1.0  ┆ 6     ┆ Healthy_control1 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
shape: (5, 2)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ group_name           ┆ n methylations │
│ ---                  ┆ ---            │
│ str                  ┆ u32            │
ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ Jurkat_betuline1     ┆ 0              │
│ Jurkat_betuline2     ┆ 123            │
│ Jurkat_DMSO_control1 ┆ 0              │
│ Jurkat_DMSO_control2 ┆ 0              │
│ Healthy_control1     ┆ 125            │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
No description has been provided for this image

This outputs a similar result to the first barplot The only difference is that the annotation added an extra promoter area that has been connected to the CDK1 gene. This is usable for now, but i'll have to ask Martijn if it is correct. The next steps will be cleaning up a bit of my code.

The way the main df is being filtered is currently incorrect, i will have to create a function for this to make it more usable

03-12-2024¶

I noticed that i copied code to select dataa, i am going to create a function for this, so that it'll be easier. This function will take a filtered promoter_bed_df with annotated gene named, filtered on the genes of interest. The function uses that dataframe to find all methylation data in those promoter regions.

InĀ [23]:
def filter_data_from_bed(main_df, promoter_df):
    final_subsetted_df: pd.DataFrame = pl.DataFrame(
        {"chr":[],
         "start":[],
         "end":[],
         "frac":[],
         "valid":[],
         "group_name":[]}
    )
    for row in promoter_df.iter_rows():
        (chromosome, promoter_start, promoter_end, gene_name) = row
        subsetted_df = main_df.filter(
            (pl.col("chr") == chromosome) &
            (pl.col("start") >= promoter_start) &
            (pl.col("end") <= promoter_end)
        )
        final_subsetted_df = pl.concat([subsetted_df, final_subsetted_df])
        
    return final_subsetted_df
cdk1_test_df = filter_data_from_bed(df, df_cdk1)
print(cdk1_test_df.head())
shape: (5, 6)
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ chr   ┆ start    ┆ end      ┆ frac ┆ valid ┆ group_name       │
│ ---   ┆ ---      ┆ ---      ┆ ---  ┆ ---   ┆ ---              │
│ str   ┆ i64      ┆ i64      ┆ f64  ┆ i64   ┆ str              │
ā•žā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”
│ chr10 ┆ 60778212 ┆ 60778213 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778217 ┆ 60778218 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778237 ┆ 60778238 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778258 ┆ 60778259 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
│ chr10 ┆ 60778283 ┆ 60778284 ┆ 0.0  ┆ 1     ┆ Jurkat_betuline2 │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

04-12-2024¶

To see if this works, i will create a function that can count the amount of methylation data.

InĀ [24]:
def count_methylation_data(main_df):
    return (main_df
    .select(["group_name", "frac"])
    .group_by("group_name")
    .agg([pl.len().alias("n methylations")])
    .join(all_groups, on="group_name", how="full")
    .with_columns(pl.col("group_name").fill_null(pl.col("group_name_right")))
    .drop("group_name_right") 
    .fill_null(0)
)
    
cdk1_count_data = count_methylation_data(cdk1_test_df)
cdk1_count_data
Out[24]:
shape: (5, 2)
group_namen methylations
stru32
"Jurkat_betuline1"0
"Jurkat_betuline2"162
"Jurkat_DMSO_control1"0
"Jurkat_DMSO_control2"0
"Healthy_control1"169

This table can be used as data for barplot, to showcase how many methylations every group has for the selected gene(s). I'm going to create a function that will plot this data into a bar plot

InĀ [25]:
def code_count_data(data_df):
    sns.barplot(data = data_df.sort("n methylations", descending=True),
            y = "group_name", x = "n methylations",
            hue="group_name", palette="Set2")
    plt.show()
code_count_data(cdk1_count_data)
No description has been provided for this image

What this showcases is that there were multiple promoter sites found for the CDK1, sites that possibly weren't found by Martijn.

InĀ [26]:
fig = px.bar(cdk1_count_data, y="n methylations", x="group_name", color="group_name",
             title="Amount of methylations for the CDK1 gene")
fig.show()

09-12-2024¶

hvplots can be used to create super customisable interactive plots, which i will use for the dashboard.

InĀ [27]:
def plot_barchart(df: pl.DataFrame):
    barplot = df.hvplot.bar(x = "group_name", y="n methylations",
                  color = "group_name", cmap = "Category10", width = 900)
    return barplot
    
plot_barchart(cdk1_count_data)
Out[27]:

This function generates an interactive plot, perfect for a website.

10-12-2024¶

Now i have to see what other plots i can use to visualise The scatter plot might be usuable to plot the specific methylation points. This should not be a fast plot, with all of the data, so the user must filter on range, chr and group to speed it up.

InĀ [28]:
def plot_scatter(df):
    df = (df.
          filter(pl.col("chr").is_in("chr"+chrome_mapping["Chromosome name"].unique())))
    df = df.sample(fraction=0.02)
    return df.hvplot.scatter(x = "start", y="chr", by="group_name", width = 900)

plot_scatter(df)
Out[28]:

At a first glace this plot looks like a mess of grey, but once filtered on chr, groups and ranges it'll be a lot more readable. Thanks to the interactivity of hvplot the user will be able to zoom into the plot, and look at more specific ranges of the genome. The colors will differentiate the groups. Not sure if it is usable, will test it in the panel site.

InĀ [29]:
def plot_violin(df):
    return  df.hvplot.violin(y = "start", by='group_name', c="group_name", width = 900)
plot_violin(df)
Out[29]:

This plot does not really work, It also calculates values like the mean and IQR. Those values are useless for genomic locations

InĀ [30]:
def plot_kde(df):
    df = df.select(["group_name", "start"])
    return df.hvplot.kde(by="group_name")

plot_kde(df)
Out[30]:

This plot showcases the density of methylations over all chromosomes. The df could be filtered on chromosome and range to get a more specific view